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Extensive research efforts have been dedicated to 3D model retrieval in recent decades. Recently, 
view-based methods have attracted much research attention due to the high discriminative property 
_ of multi-views for 3D object representation. In this report, we summarize the view-based 3D model 

. methods and provide the further research trends. This paper focuses on the scheme for matching between 

multiple views of 3D models and the application of bag-of-visual-words method in 3D model retrieval. 
For matching between multiple views, the many-to-many matching, probabilistic matching and semi- 
supervised learning methods are introduced. For bag-of-visual-words application in 3D model retrieval, 

> 

we first briefly review the bag-of-visual-words works on multimedia and computer vision tasks, where 
\Q ■ the visual dictionary has been detailed introduced. Then a series of 3D model retrieval methods by using 

bag-of-visual-words description are surveyed in this paper. At last, we summarize the further research 

00 

1 content in view-based 3D model retrieval. 
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I. Introduction 

With the fast development of internet technology, computer hardware, and software, 3D models 
[1,2] have been widely used in many applications, such as computer graphics, computer vision, 
CAD and medical imaging. Effectively and efficiently retrieve 3D model retrieval [3,4,6,54,65, 
68,78,80,81] has attracted much research attention these days. 
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3D model retrieval methods can be divided into two categories: model-based methods and 
view -based methods. Early works are mainly model-based methods, in which low-level feature- 
based methods [9,82,86] (e.g. the geometric moment [11], surface distribution [88], volumetric 
descriptor [82,85], and surface geometry [15]— [18,87]) or high-level structure-based methods 
[19] are employed. Due to the requirement of 3D models, these methods are limited in the 
practical applications. 

Extensive research efforts have been dedicated to view-based 3D model retrieval methods 
[21,22] because of the high discriminative property of multi- views for 3D object representation 
[61,62]. Many view-based 3D object retrieval methods (e.g. Light Field Descriptor (LFD) [23], 
Elevation Descriptor (ED) [24], Bag-of- Visual-Features (BoVF) [62], and Compact Multi- View 
Descriptor (CMVD) [26]) have been proposed these years. This is due to the fact that view- 
based methods [72] are with the highly discriminative property for object representation and 
visual analysis [64,71] also plays an important role in multimedia applications. 

The advantages of the view-based method are twofold. 

1) It does not require the explicit virtual model information, which makes the method robust 
to real practical applications. 

2) Image processing has been investigated for many decades. The view-based 3D model 
analysis methods can be benefited from existing image processing technologies. 

In this technical report, we review related works and recent trends in view-based 3D model 
retrieval, especial focus on the multiple-view matching scheme and the application of existing 
bag-of-words methods in 3D model retrieval. 

H. Multiple View Matchlng 

For view-based 3D model retrieval, each 3D model is represented by a group of 2D views. 
How to perform multiple view matching is the key topic in view-based 3D model retrieval task. 
In this section, we briefly review existing multiple view matching methods. 

Chen et al. first proposed the Light Field Descriptor (LFD) [23]. LFDs are computed from 
10 silhouettes obtained from the vertices of a dodecahedron over a hemisphere. This image set 
described the spatial structure information from different views. In LFD, Zernike moments and 
Fourier descriptors of the 3D model were employed as the features of each image. This method 
found the best match between two LFDs as the similarity between two 3D models. 
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Ansary et al. [ ] introduced an Adaptive Views Clustering (AVC) method . In AVC, there are 
320 initial views which are captured and representative views are optimally selected by adaptive 
view clustering with Bayesian information criteria. A probabilistic method is then employed to 
calculate the similarity between two 3D models, and those objects with high probability are 
selected as the retrieval results. There are two parameters in the method, which are used to 
modulate the probabilities of objects and views, respectively. 

Daras et al. proposed a Compact Multi-View Descriptor (CMVD) [26] method, in which 18 
characteristic views of each 3D model are first selected through 18 vertices of the corresponding 
bounding 32-hedron. In CMVD, both the binary images and the depth images are taken to 
represent the views. Then the comparison between 3D models was based on the feature matching 
between selected views using 2D features, such as 2D Polar-Fourier Transform, 2D Zernike 
Moments, and 2D Krawtchouk Moments. For the query object, the testing object rotated and 
found the best matched direction for the query object. The minimal sum of distance from the 
selected rotation direction was calculated to measure the distance between two objects. 

In [36], 7 representative views from three principal and four secondary directions were acquired 
to index objects. The contour-based feature was extracted for each view for multi-view matching. 
In [56], query views were re- weighted using the relevance feedback information by multi-bipartite 
graph reinforcement model. In this method, the weights of query views were generated using 
the information propagation from the labelled retrieval results. Dealing with the view set of each 
3D model, an incremental representative view selection method [79] has been proposed. In this 
method, the representative views for the query model is selected by using the user relevance 
feedback, and a distance metric is learnt for each selected views. Gao et al. [83] proposed to 
update the weights for representative views by using pseudo-relevance feedback information, 
where a graph-based learning process is conducted to renew the query views' weights. 

Some methods employed the generated view to represent 3D models. Panoramic object rep- 
resentation for accurate model attributing (PANORAMA) [ ] employed panoramic views to 
capture the position of the model's surface information as well as its orientation as the 3D 
model descriptor. The panoramic view of a 3D model was obtained by projecting the 3D model 
to the lateral surface of a cylinder aligned with one of the object's three principal axes and 
centered at the centroid of the object. 

Gao et al. proposed a Spatial Structure Circular Descriptor (SSCD) [10], which can preserve 
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the global spatial structure of 3D models, and it was invariant to rotation and scaling. All spatial 
information of 3D model can be represented by an SSCD which included several SSCD images. 
In SSCD, a minimal bounding sphere of the 3D model was computed, and all points on the 3D 
model surface were projected to the bounding sphere. Attribute values are provided with each 
point to represent the surface spatial information. The bounding sphere was further projected to 
a circular region of a plane. It can preserve the spatial structure of the original 3D model. This 
circular image was employed by each SSCD image to describe the surface information of a 3D 
model. Each spatial part of a 3D model is represented by one part of the SSCD individually. 
Histogram information was employed by SSCD as the feature of SSCDs to compare two 3D 
models. 

Shih et al. proposed an Elevation Descriptor (ED) [ ]. ED extracts six range views to describe 
the original 3D model from its bounding box. These views contain the altitude information of 
the 3D model from six directions. 3D models are compared based on the matching of EDs. To 
match two groups of EDs, the minimal distance with order is calculated to measure the distance 
between the two 3D models. 

Many exiting distance measures have also been investigated in view-based 3D model retrieval, 
e.g., the Hausdorf distance [57,59], the Earth Mover's Distance [58] and the bipartite graph 
matching [45]. Gao et al. [5] proposed a bipartite graph matching-based 3D model comparison 
method. In this method, the weighted bipartite graph matching (WBGM), is employed for 
comparison between two 3D models. In the view-based 3D model retrieval, each 3D model is 
represented by a set of 2D views. Representative views are first selected and the corresponding 
initial weights are provided and further updated using the relationship among representative 
views. The weighted bipartite graph is built with these selected 2D views, and the proportional 
max- weighted bipartite matching method [45] is employed to find the best match in the weighted 
bipartite graph. Wen et al. [77] further extended the bipartite graph matching method in 3D 
model matching, in which the constructed bipartite graph was first partitioned into subsets, and 
the matching was conducted in each subset. 

Gao et al. [51] proposed a probabilistic framework to compare two 3D models. In this method, 
for every query object, we first cluster all query views to generate the view clusters, which are 
then used to build the query gaussian models. For more accurate 3D object comparison, a 
positive matching model and a negative matching model are individually trained using positive 
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matched samples and negative matched samples, respectively. The CCFV model is generated 
on the basis of the query gaussian models by combining the positive matching model and the 
negative matching model. 

Semi-supervised learning (graph-based) [29] has been applied on both multimedia indexing 
and retrieval tasks [40,49], and has shown its superiority on investigating both the labeled 
and unlabeled data. Wang et al. [ ] employed graph-based semi-supervised learning on multi- 
concept detection. A multi-graph based semi-supervised learning method is introduced in [43] to 
fuse information from different sources by using a multi-graph framework. Towards a diverse and 
relevant search of social images, the graph-based semi-supervised learning has been employed 
in [38] to learn the relevance scores for image to the concept. A semi-supervised kernel density 
estimation method is introduced in [42] to annotate videos. In this method, the kernel density 
is estimated under the semi-supervised learning framework, and the optimal video annotation is 
achieved by the learning in the framework. Tang et al. [32] introduced to construct correlative 
linear neighborhood connection in the graph structure to learn the optimal relevance scores 
from video annotation. Semi-supervised learning on the hypergraph structure [30] has been 
investigated in many multimedia and computer vision tasks, e.g., image search [66,90,121,124]. 
A hypergraph-based 3D object representation method is presented in [27], which constructs the 
hypergraph by using the correlation among different surface boundary segments of an object in 
the CAD system. Yu et al. [31] introduced a hypergraph learning method for image classification. 
In this method, the relevance scores among images are learned associated with the weights 
of constructed hyperedges. A hypergraph learning-based social image method is proposed in 
[66]. A class- specific hypergraph (CSHG) [28] is proposed to integrate local SIFT and global 
geometric constraints for object recognition, where the hypergraph is employed to model a 
specific category of objects with multiple appearance instances. Gao et al. [13] introduced to 
formulate the relationship among different 3D models in a hypergraph structure. Each view of 
one 3D model is treated as the feature of the 3D model. The view clustering is first performed 
to generate view groups, and each view group generates one hyperedge, in which the 3D models 
with views in that view group are connected by this hyperedge. The semi-supervised learning 
on this constructed hypergraph structure is conducted to generate the relevance score among 3D 
models, which can be further used for 3D model retrieval. 
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III. Bag-of- Visual- Words and Its Application in 3D Model Retrieval 

Generally, the local features [69] [70] extracted from images are quantized into a set of visual 
words, where a visual word dictionary is created to generate an indexing file. Each image can be 
described by using a Bag-of- Words histogram. This Bag-of- Words representation offers sufficient 
robustness against photographing variances in occlusions, viewpoints, illuminations, scales and 
backgrounds. In the application of Bag-of- Words method, the employed visual vocabulary for 
visual words plays an important role in the whole algorithm. In this section, we first introduce 
the existing visual vocabulary training method in detail, and following we provide the application 
of the Bag-of-Words method in 3D model retrieval. 

A. Visual Vocabulary Training in the Bag-of-Word method 

Typically speaking, building visual vocabulary usually resorts to unsupervised vector quantiza- 
tion^] [93] [67] [84] [94] [67], which subdivides the local feature space into discrete regions each 
corresponds to a visual word. An image is represented as a Bag-of-Words (BoW) histogram, 
where each word bin counts how many local features of this image fall in the corresponding 
feature space partition of this word. To this end, many vector quantization schemes are proposed 
to build visual vocabulary, such as K-means [95], Hierarchical K-means (Vocabulary Tree) [92], 
Approximate K-means [93], and their variances [96] [1 17] [97] [93] [94] [98][84][99]. Meanwhile, 
hashing local features into a discrete set of bins and indexed subsequently is an alternative choice, 
for which methods like Locality Sensitive Hashing (LSH) [100], Kernalized LSH [101], Spectral 
Hashing [102] and its variances [103] [104] are also exploited in the literature. The visual word 
uncertainty and ambiguity are also investigated in [105] [93] [106] [107], using methods such as 
Hamming Embedding [105], Soft Assignments [93] and Kernelized Codebook [107]. Some other 
related directions include optimizing the initial inputs of visual vocabulary construction, such as 
learning a better local descriptor detector as in [108], coming up with a better similarity metric, 
such as learning an optimal hashing based distance matching as in [39] for human action search 
[25], incorporating Bayesian reasoning into the similarity calculation [63], image annotation, 
landmark search [48][34], text detection, and distributed visual search [110]. 

Stepping forward from unsupervised vector quantization, semantic or category labels are also 
exploited [109][123][111][112][115] to supervise the vocabulary construction, which learns the 
vocabulary to be more suitable for the subsequent classifier training, e.g., the images in the same 
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category are more likely to produce similar BoW histograms and vice versa. The quantization 
issues in visual vocabulary are recently also well addressed to fit the city-scale landmark search 
scenario, such as the works in [84] [89]. And more recently, the compression of visual vocabulary 
model has also received a wide variety of research interests. 

B. Application of the Bag -of -Words Method in 3D Model Retrieval 

To apply the bag-of-visual-words method to view-based 3D model retrieval, generally the 
SIFT features are extracted from all selected views, and a visual word dictionary is learnt. A 
bag-of-visual-words description is generated for 3D model representation. To perform 3D model 
matching, KL divergence or other distance metric can be employed. By using the visual words 
description, the detailed feature of each view can be represented well. Based on this description, 
it is disciminative for 3D model representation. 

Furuya and Ohbuchi have proposed a series of Bag-of- Visual- Words-based 3D model retrieval 
methods. In [114], they first involved the bag-of-visual-words method to view-based 3D model 
retrieval. They proposed to extract SIFT features for views of 3D models. In this method, each 
3D model is rendered into a group of depth images, and the SIFT features are extracted from 
these images. This method uses the bag-of-features approach to integrate the local features into 
a feature vector for each model. Then the matching of these two feature vectors determines the 
distance between the two 3D models. Ohbuchi et al. [113] further proposed to employ Kullback- 
Leibler divergence (KLD) to calculate the distance between two bag-of-visual-feature based 3D 
models. Ohbuchi and Shimizu [62] employed the semi-supervised manifold learning method for 
model class recognition. The proposed method projects the original feature space onto a lower 
dimensional manifold. Then the relevance feedback information is employed to capture the 
semantic class information by using the manifold ranking algorithm. Though the bag-of-visual- 
feature description is effective on 3D model retrieval, the computational cost is high. Ohbuchi 
and Furuya [61] further accelerated the method by using a Graphics Processing Unit. Furuya 
and Ohbuchi [116] proposed to employ dense sampling to extract feature points in the views of 
3D models, and then the SIFT feature is extracted from each point. These visual features are 
further clustered into groups to generate the visual vocabulary. Then the feature histogram is 
generated to calculate the 3D model distance. This method has been further extended [116,125] 
to deal with large scale data. A distance metric learning method [122] is proposed to learn the 
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distance metric for matching of 3D models. 

Osada et al. [91] employed the bag-of-visual-feature method to SHREC'08 CAD model 
track task. A bag-of-region-words method [20] is introduced to extract visual features in the 
region level. This method first gridly selects points in each image and the local SIFT features 
are extracted for these points. Then each feature is encoded into a visual word with a pre- 
trained visual vocabulary. In this step, each view is split into a set of regions, and each region 
is represented by a bag-of-visual-words feature vector. All the obtained regions are further 
grouped into clusters based on the bag-of-visual-words feature, and one feature is chosen as 
the representative one from each cluster with corresponding weight. The Earth Movers Distance 
is used to measure the distance between two 3D models. Endoh et al. [120] introduced to 
conduct learning on the manifold structure of 3D models by using clustering-based training 
sample reduction. Kawamura et al. [119] further employed the geometrical feature to improve 
the feature-based method. 

IV. Discussion and Future Work 

View-based 3D model retrieval has been investigated in recent years, and many methods 
have been proposed recently. The key point for view-based 3D model retrieval lies in the 
matching of multiple views. Many works employ many-to-many matching schemes, e.g., EMD 
and bipartite graph matching, to measure the distance between two 3D models. Other works 
employ the probabilistic framework to estimate the similarity between 3D models. With the 
widely application of bag-of-visual-words method in multimedia and computer vision tasks, it 
has been employed in view-based 3D model retrieval. We have detailed introduced recent works 
by using bag-of-visual-words. 

Though there are significant progress for view-based 3D model retrieval, there are still many 
problems which require further investigation. 

1) The description method for multiple views. Though many works have focused on view rep- 
resentation, e.g., by using Zernike Moments and Bag-of- Visual-Words, precisely multiple 
views description is still important for 3D model retrieval. 

2) Estimation the relationship among 3D models by multiple views. Most of existing methods 
employ many-to-many matching methods to calculate the distance between 3D models. A 
few methods use the probabilistic method to formulate the relationship among 3D models. 
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As each 3D model is described by a group of views, the relationship among 3D models 
is complex than the relationship between just two images. 
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